The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Modern deep networks can be better generalized when trained with noisy samples and regularization techniques. Mixup and CutMix have been proven to be effective for data augmentation to help avoid overfitting. Previous Mixup-based methods linearly combine images and labels to generate additional training data. However, this is problematic if the object does not occupy the whole image as we demonstrate in Figure 1. Correctly assigning the label weights is hard even for human beings and there is no clear criterion to measure it. To tackle this problem, in this paper, we propose LUMix, which models such uncertainty by adding label perturbation during training. LUMix is simple as it can be implemented in just a few lines of code and can be universally applied to any deep networks \eg CNNs and Vision Transformers, with minimal computational cost. Extensive experiments show that our LUMix can consistently boost the performance for networks with a wide range of diversity and capacity on ImageNet, \eg $+0.7\%$ for a small model DeiT-S and $+0.6\%$ for a large variant XCiT-L. We also demonstrate that LUMix can lead to better robustness when evaluated on ImageNet-O and ImageNet-A. The source code can be found \href{https://github.com/kevin-ssy/LUMix}{here}
translated by 谷歌翻译
Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes given a speech example and its transcription. By learning to reconstruct the masked parts of the input in different languages, our model shows great improvements over speaker-embedding-based multi-speaker TTS methods. Moreover, our framework is end-to-end for both the training and the inference without any finetuning effort. In cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing tasks, our experiments show that our model outperforms speaker-embedding-based multi-speaker TTS methods. The code and model are publicly available at PaddleSpeech.
translated by 谷歌翻译
由于非平稳性,现实世界多变量时间序列(MTS)的分布会随着时间而变化,称为分布漂移。大多数现有的MT预测模型都会极大地遭受分销漂移的影响,并随着时间的推移降低了预测性能。现有方法通过适应最新到达数据或根据未来数据得出的元知识进行自我纠正来解决分布漂移。尽管在MT的预测中取得了巨大的成功,但这些方法几乎无法捕获固有的分布变化,尤其是从分布的角度来看。因此,我们提出了一个新型的框架时间条件变化自动编码器(TCVAE),以对MTS中历史观察结果和未来数据之间的动态分布依赖性进行建模,并将依赖性作为时间条件分布推断为利用潜在变量。具体而言,新型的颞鹰注意机制代表了随后馈入馈送前网络的时间因素,以估计潜在变量的先前高斯分布。时间因素的表示进一步动态地调整了基于变压器的编码器和解码器的结构,以利用门控注意机制来变化。此外,我们引入条件连续归一化流量,以将先前的高斯转化为复杂且无形式的分布,以促进对时间条件分布的灵活推断。在六个现实世界MTS数据集上进行的广泛实验表明,与最先进的MTS预测基线相比,TCVAE的出色鲁棒性和有效性。我们进一步说明了TCVAE通过多方面的案例研究和现实情况下的可视化来说明TCVAE的适用性。
translated by 谷歌翻译
很少有课堂学习(FSCIL)着重于设计学习算法,这些学习算法可以不断地从几个样本中学习一系列新任务,而不会忘记旧任务。困难是,从新任务中进行一系列有限数据的培训会导致严重的过度拟合问题,并导致众所周知的灾难性遗忘问题。现有研究主要利用图像信息,例如存储以前任务的图像知识或限制分类器更新。但是,他们忽略了分析课堂标签的信息丰富且较少的嘈杂文本信息。在这项工作中,我们建议通过采用内存提示来利用标签文本信息。内存提示可以依次学习新数据,同时存储先前的知识。此外,为了优化内存提示而不破坏存储的知识,我们提出了基于刺激的训练策略。它根据图像嵌入刺激(即嵌入元素的分布)来优化内存提示。实验表明,我们提出的方法的表现优于所有先前的最新方法,从而大大减轻了灾难性的遗忘和过度拟合问题。
translated by 谷歌翻译
现实世界的视觉搜索系统涉及具有不同计算和存储资源的多个平台上的部署。部署适合最小符合平台的统一模型会导致精度有限。预计将部署具有不同能力的模型,以适应资源约束,这要求这些模型提取的功能必须在度量空间中对齐。实现特征比对的方法称为“兼容学习”。现有的研究主要集中在一对一兼容的范式上,该范式在多个模型之间学习兼容性受到限制。我们提出了一个具有自我兼容性(SFSC)的可切换表示学习框架。 SFSC通过一个训练过程生成一系列具有不同能力的兼容子模型。子模型的优化面对梯度冲突,我们从大小和方向的角度来减轻它。我们通过不确定性估计动态调整子模型的优先级,以适当地将子模型合作。此外,预计有相互矛盾的梯度以避免相互干扰。 SFSC在评估的数据集上实现了最先进的性能。
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译
最近,语音表示学习改善了许多与语音有关的任务,例如语音识别,语音分类和语音到文本翻译。但是,以上所有任务都朝着语音理解的方向发展,但是对于反向方向,言语综合,由于产生高质量语音的挑战性质,代表性学习的潜力尚未实现。为了解决这个问题,我们提出了我们的框架,对准的声音文本预处理($^3 $ t),该框架在培训期间重建了带有文本输入和声学文本对齐的蒙面声信号。通过这种方式,预处理的模型可以生成高质量的重建频谱图,可以直接应用于语音编辑和看不见的扬声器tts。实验显示了$^3 $ t在语音编辑上的SOTA模型,并在没有外部说话者验证模型的情况下改善了多扬声器语音综合。
translated by 谷歌翻译
现有的分布式协作多智能体增强学习(MARL)框架通常假设通过共识算法估计全球奖励的无向协调图和通信图。这种框架可能导致昂贵的通信成本,并且由于全球共识的要求,可扩展性差。在这项工作中,我们使用定向协调图研究Marls,并提出了一种分布式RL算法,其中本地策略评估基于本地值函数。通过与其邻居通过定向的学习诱导的通信图来实现每个代理的本地值函数,而不使用任何共识算法。采用基于参数扰动的零顺序优化(动物园)方法来实现梯度估计。通过与现有的基于动物园的RL算法进行比较,我们表明我们提出的分布式RL算法可确保高可扩展性。示出了分布式资源分配示例来说明我们算法的有效性。
translated by 谷歌翻译
学习地区内部背景和区域间关系是加强点云分析的特征表示的两项有效策略。但是,在现有方法中没有完全强调的统一点云表示的两种策略。为此,我们提出了一种名为点关系感知网络(PRA-NET)的小说框架,其由区域内结构学习(ISL)模块和区域间关系学习(IRL)模块组成。ISL模块可以通过可差的区域分区方案和基于代表的基于点的策略自适应和有效地将本地结构信息动态地集成到点特征中,而IRL模块可自适应和有效地捕获区域间关系。在涵盖形状分类,关键点估计和部分分割的几个3D基准测试中的广泛实验已经验证了PRA-Net的有效性和泛化能力。代码将在https://github.com/xiwuchen/pra-net上获得。
translated by 谷歌翻译